Add RVV (RISC-V Vector Extension) optimized convolution and pooling kernels for the NCHWc blocked format in MLAS#28411
Add RVV (RISC-V Vector Extension) optimized convolution and pooling kernels for the NCHWc blocked format in MLAS#28411velonica0 wants to merge 2 commits into
Conversation
|
Hi @hariharans29 |
There was a problem hiding this comment.
Pull request overview
This PR extends MLAS’ riscv64 RVV support by wiring up optimized float32 convolution and NCHWc pooling kernels, enabling the NCHWc blocked-format fast paths (BlockSize=16) on RVV-capable systems and replacing the previous generic depthwise implementation.
Changes:
- Adds new riscv64 RVV kernel implementations for direct NCHW/NCHWc conv, depthwise/pointwise NCHWc conv, and max/avg pooling in NCHWc format.
- Wires the new kernels into
MLAS_PLATFORMinitialization for riscv64 when RVV is available, and enables NCHWc fast-path selection for RISCV64+RVV. - Updates build configuration to compile the new RVV sources and swap out the previous depthwise kernel source for RISCV64 builds.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/mlas/lib/snchwc.cpp | Enables RISCV64+RVV to use platform-selected NCHWc conv/pool kernels and block size. |
| onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp | New RVV implementations for NCHW/NCHWc conv, depthwise/pointwise NCHWc conv, and NCHWc pooling. |
| onnxruntime/core/mlas/lib/riscv64/sconv_depthwise_kernel_rvv.cpp | New RVV 3x3 s1 depthwise CHW kernel implementation for the multiplier-1 path. |
| onnxruntime/core/mlas/lib/platform.cpp | Registers RVV NCHWc conv/pool kernels and sets NCHWc block size for riscv64 when RVV is present. |
| onnxruntime/core/mlas/lib/mlasi.h | Declares new RVV kernel entry points and adds RISCV64+RVV NCHWc members to MLAS_PLATFORM. |
| onnxruntime/core/mlas/lib/convolve.cpp | Enables the depthwise-direct algorithm path on RISCV64 and updates stride restrictions comment/logic. |
| onnxruntime/core/mlas/inc/mlas.h | Exposes MlasConvAlgorithmDepthwise in the enum for RISCV64 builds. |
| cmake/onnxruntime_mlas.cmake | Adds new RVV kernel sources and adjusts which depthwise source is compiled for RISCV64/RVV. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Both comments point to the same concern: "What if In practice, however, this is not an issue because the minimum VLEN on current RISC-V CPUs is 128 bits. Therefore, this case is effectively covered (though I have implemented changes regardless). The calculation is as follows:
|
Thanks for the clarification. It looks good to me. Let me get one last opinion from Copilot. |
| #include "kleidiai/mlasi_kleidiai.h" | ||
| #endif | ||
|
|
||
| #if defined(MLAS_USE_RVV) |
| acc = __riscv_vfadd_vv_f32m4(acc, old_output, vl); | ||
| } | ||
|
|
||
| if (KernelFlags & MLAS_CONV_KERNEL_FLAG_BIAS_ADDITION) { |
| vfloat32m4_t sum_vec = __riscv_vfmv_v_f_f32m4(0.0f, vl); | ||
| uint32_t valid_count[BlockSize]; | ||
|
|
||
| if (ExcludePad) { | ||
| for (size_t i = 0; i < BlockSize; i++) { | ||
| valid_count[i] = 0; | ||
| } | ||
| } |
| float results[BlockSize]; | ||
| __riscv_vse32_v_f32m4(results, sum_vec, vl); | ||
| for (size_t i = 0; i < BlockSize; i++) { | ||
| results[i] /= static_cast<float>(valid_count[i]); |
| const size_t pad_top = Parameters->Padding[0]; | ||
| const size_t pad_left = Parameters->Padding[1]; | ||
| const size_t pad_right = Parameters->Padding[3]; |
| if (pad_left && ow < out_cols) { | ||
| const ptrdiff_t iw = static_cast<ptrdiff_t>(ow) - static_cast<ptrdiff_t>(pad_left); | ||
| float acc = DepthwiseComputeEdge( | ||
| row0, row1, row2, iw, W, | ||
| w00, w01, w02, w10, w11, w12, w20, w21, w22 | ||
| ); | ||
| if (accumulate_output) { | ||
| acc += beta * out_row[ow]; | ||
| } | ||
| out_row[ow++] = acc; | ||
| } |
| out_row[ow++] = acc; | ||
| } | ||
|
|
||
| if (pad_right && ow < out_cols) { |
| for (size_t kh = 0; kh < KernelHeight; kh++) { | ||
| for (size_t kw = 0; kw < KernelWidth; kw++) { | ||
| const float* input_base = Input + output_idx * StrideWidthElements + | ||
| kh * DilatedInputWidthElements + kw * DilationWidthElements; | ||
|
|
||
| const float* input_row_start = InputBase + kh * DilatedInputWidthElements; | ||
| const float* input_row_end = input_row_start + InputWidthElements; | ||
|
|
||
| size_t kernel_pos = kh * KernelWidth + kw; | ||
|
|
||
| for (size_t ic = 0; ic < BlockSize; ic++) { | ||
| const float* input_element = input_base + ic; | ||
|
|
||
| float input_value = 0.0f; | ||
| if (input_element >= input_row_start && input_element < input_row_end) { | ||
| input_value = *input_element; | ||
| } | ||
|
|
||
| size_t filter_offset = kernel_pos * BlockSize * BlockSize + ic * BlockSize; | ||
| vfloat32m4_t filt = __riscv_vle32_v_f32m4(&filter[filter_offset], vl); | ||
| acc = __riscv_vfmacc_vf_f32m4(acc, input_value, filt, vl); | ||
| } |
|
Copilot generated a lot more comments in this round - can you please take a look ? |
Description
New kernel files:
Motivation and Context
Following #28261, Optimize more MLAS kernels using RISC-V Vector (RVV) extensions.
Please Note:
On the K3 (SpacemiT X60), VLEN=256. With LMUL=4 and e32, the hardware can hold (256/32) * 4 = 32 floats per vector register group — but we only request 16. So we're using half the available vector width.
The reason is that BlockSize=16 is baked into the NCHWc data layout across the whole framework (matching ARM64 NEON). Changing it to 32 would require a different NCHWc format and is not a localized change.
Benchmark ((SpacemiT K3, VLEN=256, 8-core))
All tests pass with zero numerical error.